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Abstract 

The matrix completion problem consists of finding or approximating a low-rank matrix 
based on a few samples of this matrix. We propose a new algorithm for matrix completion 
that minimizes the least-square distance on the sampling set over the Riemannian manifold of 
fixed-rank matrices. The algorithm is an adaptation of classical non-linear conjugate gradients, 
developed within the framework of retraction-based optimization on manifolds. We describe all 
the necessary objects from differential geometry necessary to perform optimization over this low- 
rank matrix manifold, seen as a submanifold embedded in the space of matrices. In particular, 
we describe how metric projection can be used as retraction and how vector transport lets us 
obtain the conjugate search directions. Finally, we prove convergence of a regularized version of 
our algorithm under the assumption that the restricted isometry property holds for incoherent 
matrices throughout the iterations. The numerical experiments indicate that our approach 
scales very well for large-scale problems and compares favorably with the state-of-the-art, while 
outperforming most existing solvers. 

This report is the extended version of the manuscript \5J$ - It differs only by the addition of 
Appendix [3J 



1 Introduction 

Let A 6 ]SJ nxn be an m x n matrix that is only known on a subset £1 of the complete set of entries 
{1, . . . , m} x {1, . . . , n}. The low-rank matrix completion problem 16 consists of finding the matrix 
with lowest rank that agrees with A on S7: 

minimize rank(X), 

x • (1) 

subject to X £R mxn , P n (X) = P n (A), 

where 

{Xi if (i, j) G U, 
(2) 
o if(i,j)?n, 

denotes the orthogonal projection onto Q. Without loss of generality, we assume m < n. 
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Due to the presence of noise, it is advisable to relax the equality constraint in ([I]) to allow for 
misfit. Then, given a tolerance e > 0, a more robust version of the rank minimization problem 
becomes 

minimize rank(X), 

subject to X G M mxn , || P n (X) - P n (A)|| F < e, ' 



(3) 



where ||^||f denotes the Frobenius norm of X. 

Matrix completion has a number of interesting applications such as collaborative filtering, sys- 
tem identification, and global positioning, but unfortunately it is NP hard. Recently, there has been 
a considerable body of work devoted to the identification of large classes of matrices for which ([!]) 
has a unique solution that can be recovered in polynomial time. In 18 , for example, the authors 



show that when is sampled uniformly at random, the nuclear norm relaxation 



minimize 

x 



\X\ 



subject to X £ 



PfiPO =P n (A), 



(4) 



can recover with high probability any matrix A of rank k that has so-called low incoherence, 
provided that the number of samples is large enough, > Cnk polylog(n). Related work has 



been done by 16,17,30,31 , which in particular also establish similar recovery results for the robust 
formulation (|3]). 

Many of the potential applications for matrix completion involve very large data sets; the Netflix 
matrix, for example, has more than 10 8 entries [H] . It is therefore crucial to develop algorithms that 
can cope with such a large-scale setting, but, unfortunately, solving Q by off-the-shelf methods 
for convex optimization scales very badly in the matrix dimension. This has spurred a considerable 
amount of algorithms that aim to solve the nuclear norm relaxation by specifically designed meth- 
ods that try to exploit the low-rank structure of the solution; see, e.g., 15,24,38-41,53 . Other 



approaches include optimization on the Grassmann manifold [7 22 31 ; atomic decompositions 36 
and non-linear SOR 156] . 



1.1 The proposed method: optimization on manifolds 

We present a new method for low-rank matrix completion based on a direct optimization over the 
set of all fixed-rank matrices. By prescribing the rank of the global minimizer of ([3]), say k, the 
robust matrix completion problem is equivalent to 

minimize f(X) := ||| P n (X - A)\\ 2 F , 

x (5) 
subject to X £M k : = {X £ R mxn : rank(X) = k}. 

It is well known that A4k is a smooth (C°°) manifold; it can, for example, be identified as the 
smooth part of the determinantal variety of matrices of rank at most k 12, Proposition 1.1]. Since 
the objective function / is also smooth, problem (|5j) is a smooth optimization problem, which can be 
solved by methods from Riemannian optimization as introduced among st others by (2j|6j(23] . Simply 



put, Riemannian optimization is the generalization of standard unconstrained optimization, where 
the search space is M n , to optimization of a smooth objective function on a Riemannian manifold. 

For solving the optimization problem ^ , in principle any method from optimization on Rieman- 
nian manifolds could be used. In this paper, we use a generalization of classical non-linear conjugate 
gradients (CG) on Euclidean space to perform optimization on manifolds; see, e.g., [2 , 23 , 52 . The 
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reason for this choice compared to, say, Newton's method was that CG performed best in our nu- 
merical experiments. The skeleton of the proposed method, LRGeomCG, is listed in Algorithm [TJ 
Algorithm [T] is derived using concepts from differential geometry, yet it closely resembles a typi- 
cal non-linear CG algorithm with Armijo line-search for unconstrained optimization. It is schemat- 
ically visualized in Figure [T] for iteration number i and relies on the following crucial ingredients 
which will be explained in more detail for (|5j) in Section [2] 

1. The Riemannian gradient, denoted grad/(Xj), is a specific tangent vector £j which corre- 
sponds to the direction of steepest ascent of f(Xi), but restricted to only directions in the 
tangent space Tx 4 Alfc. 

2. The search direction rji S TjqAI/c is conjugate to the gradient and is computed by a variant 
of the classical Polak-Ribiere updating rule in non-linear CG. This requires taking a linear 
combination of the Riemannian gradient with the previous search direction rji-\. Since 
does not lie in Tj^A^fc, it needs to be transported to T^Alfe. This is done by a mapping 
Txi-i-^Xi ■ Tx i _ 1 M-k Txi-Mk, the so-called vector transport. 

3. As a tangent vector only gives a direction but not the line search itself on the manifold, a 
smooth mapping Rx z ■ Tx i M.k M-k, called the retraction, is needed to map tangent vectors 
to the manifold. Using the conjugate direction rji, a line-search can then be performed along 
the curve t \-t Rxi(tr]i). Step 6 uses a standard backtracking procedure where we have 
chosen to fix the constants. More judiciously chosen constants can sometimes improve the 
line search, but our numerical experiments indicate that this is not necessary in the setting 
under consideration. 



Algorithm 1 LRGeomCG: geometric CG for (5) 



# see Algorithm [2] 



Require: initial iterate X\ G Alfc, tolerance r > 0, tangent vector r]o = 
1: for i = 1, 2, ... do 
2: Compute the gradient 

& :=grad/(A 4 ) 
3: Check convergence 

if < t then break 
4: Compute a conjugate direction by PR+ 

m ■= + PiTxi^XiiVi-l) 
5: Determine an initial step ti from the linearized problem 

U = arg min t f(Xi+tr]i) 
6: Perform Armijo backtracking to find the smallest integer m > such that 

f(Xi) - f(R Xi (0.5 m t im )) > -0.0001 x 0.5 m U 
and obtain the new iterate 

X t+l := R Xl (0.5 m U m) # see Algorithm © 

7: end for 



# see All 

# see Al; 



forithm |4] 
$orithm [5] 



1.2 Relation to existing manifold-related methods 

At the time of submitting the present work, a large number of other matrix completion solvers 



based on Riemannian optimization have been proposed in [10 , 11 21 30, 31 , 42, 44 46 51 . Like 
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PiTxi^Xiim-i) 

Vi 




T Xl Mk 



X i+1 = R Xi {V.h m t i r H ) 



Figure 1: Visualization of Algorithm [TJ non-linear CG on a Riemannian manifold. 



the current paper, all of these algorithms use the concept of retraction-based optimization on the 
manifold of fixed-rank matrices but they differ in their specific choice of Riemannian manifold 
structure and metric. It remains a topic of further investigation to assess the performance of these 
different geometries with respect to each other and to non-manifold based solvers. 

A first attempt at such a comparison has been very recently done in [47] where the previous 
geometries were tested to complete one large matrix. The authors reach the conclusion that for 
gradient-based algorithms all these geometries perform remarkably more or less the same with 

the GH T geometry of 



47 



42 



respect to the total computational time. For the particular setup of 
turned out to result in the most efficient solver, but overall most geometries — including ours — 
outperformed the state-of-the art. In addition, Newton-based algorithms performed well when 
high precision was required and now the algorithm based on our embedded submanifold geometry 
was clearly faster. 



1.3 Outline of the paper 

The plan of the paper is as follows. In the next section, the necessary concepts of differential 
geometry are explained to turn Algorithm [T] into a concrete method. The implementation of 
each step is explained in Section [3j We also prove convergence of a slightly modified version of 
this method in Section [4] under some assumptions which are reasonable for matrix completion. 
Numerical experiments and comparisons to the state-of-the art are carried out in Section [5j The 
last section is devoted to conclusions. 



2 Differential geometry for low-rank matrix manifolds 

In this section, we explain the differential geometry concepts used in Algorithm [T] applied to our 
particular matrix completion problem ((5j). 
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2.1 The Riemannian manifold 



Let 

M k = {l£ R mxn : rank(X) = k] 
denote the manifold of fixed-rank matrices. Using the SVD, one has the equivalent characterization 

M k = {UZV T : U G St£\ V G StjJ, S = diag(^), <r x > ■ ■ ■ > a k > 0}, (6) 

where St™ is the Stiefel manifold of m x k real, orthonormal matrices, and diag(<7j) denotes a 
diagonal matrix with <jj on the diagonal. Whenever we use the notation UT,V T in the rest of the 
paper, we always mean matrices that satisfy ([6]). Furthermore, the constants k,m,n are always 
used to denote dimensions of matrices and, for simplicity, we assume 1 < k < m < n. 

The following proposition shows that Mk is indeed a smooth manifold. While the existence of 
such a smooth manifold structure, together with its tangent space, is well known (see, e.g., 28|32 for 
applications of gradient flows on Mk), more advanced concepts like retraction-based optimization 
on Mk have only very recently been investigated; see 50,51 . In contrast, the case of optimization 
on symmetric fixed-rank matrices has been studied in more detail in 29,43,48,55]. 

Proposition 2.1 The set Mk is a smooth submanifold of dimension (m + n — k)k embedded in 
R mxn . Its tangent space T x Mk at X = UT,V T G Mk is given by 



TxMy 



[u u ± ] 



5 (m— k) X k 



okx(n—k) 







(m—k) x (n—k) 



[v v4 



{UMV T + U P V T + UVjf : M G R kxk , 



t/ p G 



v 

omxk 



(7) 
(8) 



U = 0, V p G 



i nxk , vjv 



0}. 



Proof. See |35[ Example 8.14] for a proof that uses only elementary differential geometry based on 
local submersions. The tangent space is obtained by the first-order perturbation of the SVD and 
counting dimensions. □ 
The tangent bundle is defined as the disjoint union of all tangent spaces, 



TM k .= [J {X} x TxMk = {(X,£) G 
xeM k 



': X £ Mk, £ G TxMk}- 



By restricting the Euclidean inner product on M mxn , 

(A, B) = tx{A T B) with A,B £ R mxn , 
to the tangent bundle, we turn Mk into a Riemannian manifold with Riemannian metric 

9x(t V) ■= (£, V) = tr (£ T ?7) with X G Mk and £, r) G T x M k , (9) 

where the tangent vectors £,ry are seen as matrices in M mxn . 

Once the metric is fixed, the notion of the gradient of an objective function can be introduced. 
For a Riemannian manifold, the Riemannian gradient of a smooth function f : Mk — > 1 at 1 £ Mk 
is defined as the unique tangent vector grad/(X) in TxMk such that 



<grad/pO,0 = D /(*)[£] for all £ G T x M k , 
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where we denoted directional derivatives by D /. Since A4 k is embedded in IR mxn , the Riemannian 
gradient is given as the orthogonal projection onto the tangent space of the gradient of / seen as a 
function on R mxn ; see, e.g., § (3.37)]. Defining P v := UU T and Pjj := I - P v for any U G St£\ 
we denote the orthogonal projection onto the tangent space at X as 

P TxMk :R mxn ^T x M k , Z^PuZPy + P^ZPy + PuZP^. (10) 

Then, using Pq(X — A) as the (Euclidean) gradient of f(X) = \\ Pq(X — A)\\ F /2, we obtain 

grad/(X) := P TxMk (P Q (X - A)). (11) 

2.2 Metric projection as retraction 

As explained in Section we need a so-called retraction mapping to go back from an element in 
the tangent space to the manifold. In order to prove convergence within the framework of |2j, this 
mapping has to satisfy certain properties such that it becomes a first-order approximation of the 
exponential mapping on A4 k . 

Definition 2.2 (Retraction |3, Definition 1]) A mapping R: TA4 k — > A4 k is sa id to be a re- 
traction on A4 k if, for every X G A4 k , there exists a neighborhood U around (X,0) C TAij. such 
that the following holds. 

(1) U C dom(ii) and R\u' U -Mk i s smooth. 

(2) R(X, 0)=X for all (X, 0) G U. 

(3) D R(X, 0) [0, = £ for all (X, £) G U. 

We also introduce the following shorthand notation: 

R x :T x M k ^M k , £^R(X,0. 

In our setting, we have chosen metric projection as retraction: 

R x :U x ^M k , i^P Mk {X + i) :=argmin||X + e-^||F, (12) 

zeM k 

where Ux C TxM. k is a suitable neighborhood around the zero vector, and Pjvi k is t ne orthogonal 
projection onto A4 k . Based on |37[ Lemma 2.1], it can be easily shown that this map satisfies the 
conditions in Definition 12. 21 

The retraction as metric projection can be computed in closed-form by the SVD: Let X G A4 k 
be given, then for sufficiently small £ G Ux C TxA4 k , we have 

k 

Rx(0 = PM k (X + = /2 (J i u i v ?i ( 13 ) 

i=l 

where o~i,Ui,Vi are the (ordered) singular values and vectors of the SVD of X + £. It is obvious 



that when a k = there is no best rank-A: solution for (12), and additionally, when o~ k -i = o k the 



minimization in (12) does not have a unique solution. In fact, since the manifold is non-convex, 



metric projections can never be well defined on the whole tangent bundle. Fortunately, as indicated 



in Definition 2.2, the retraction has to be defined only locally because this is sufficient to establish 



convergence of the Riemannian algorithms. 
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2.3 Riemannian Newton on M. k 

Although we do not exploit second-order information in Algorithm [TJ Newton or its variants may 



be preferable for some applications. In 47 , for example, it was shown that the second-order 
Riemannian trust-region algorithm of [I] performs very well in combination with our embedded 
submanifold geometry. We will therefore give the following theorem with an explicit expression of 
the Riemannian Hessian of /. Its proof is a straightforward but technical analog of that in |55| for 
symmetric fixed-rank matrices. It can be found in Appendix [A} 

Proposition 2.3 For any X = UYy T G A4 k satisfying Q and £ G T x Ai k satisfying Q, the 
Riemannian Hessian of f at X in the direction of £ satisfies 

Hess f{X)[£] = Pu P n (0 Py + Pf> [P n (0 + ?n(X - A)V^T X V T ] P v 



+ Pi/(P n (0 + WrWPaiX - A)) P 



v ■ 



2.4 Vector transport 



Vector transport was introduced in [2] and 49 as a means to transport tangent vectors from one 
tangent space to another. In a similar way as retractions are approximations of the exponential 
mapping, vector transport is the first-order approximation of parallel transport, another important 
concept in differential geometry. 

The definition below makes use of the Whitney sum, which is the vector bundle over some space 
obtained as the direct sum of two vector bundles over that same space. In our setting, we can define 
it as 

TMk © TMk = (J {X} x T x Mk x T x M k 

xeM k 

= {(X,rj,0:X e Mk,i G T x M k , V e T x M k }. 

Definition 2.4 (Vector transport |49, Definition 2]) A vector transport on the manifold M k 
is a smooth mapping: TA4 k © TM k —> TM k , (X, 77, £) 1— > 7^(£) satisfying the following properties 
for all X G M k . 

(1) There exists a retraction R such that, for all (X, 77, £) G TA4 k ®TM k , it holds that 7^(£) G 
T Rx ( v )M k . 

(2) %(0=Uor alli^T x M k . 

(3) For all (X, 77) G TM. k , the mapping T v : T x M. k — > T Rx ^M. k , £ h-> 7^(£) is linear. 

By slight abuse of notation, we will use the following shorthand notation that emphasizes the 
two tangent spaces involved: 

T X ^ Y : T x M k -> T Y M k , £ ^ T R -i (Y) (0, 



where T R -i^ Y ) uses t ne notation of Definition 



2.4 



See also Figure [2] for an illustration of this 
definition. By the implicit function theorem, R x is locally invertibTe for every X G Ai k but 
R x (Y) does not need to exist for all Y G M. k . Hence, T x ^y has to be understood locally, i.e., for 
Y sufficiently close to X. 
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Figure 2: Vector transport on a Riemannian manifold. 

Since A4/% is an embedded submanifold of M. mxn , orthogonally projecting the translated tangent 
vector in R mxrt onto the new tangent space constitutes a vector transport; see [2J Section 8.1.2]. 
In other words, we get 

Tx^Y :TxMk^TYM k ,^P TY M k (0, (14) 

with PT Y M k as defined in ( [To] ), 

As explained in Section the conjugate search direction rji in Algorithm [T] is computed as a 
linear combination of the gradient and the previous direction: 

j]i = -grad/(Zj) +ft7x l _ 1 ^x ! (??i-i), 

where we transported For we have chosen the geometrical variant of Polak-Ribiere (PR+), 
as introduced in 12} Chapter 8.2]. Again using vector transport, this becomes 

= (grad/pfi), gmdfjX^-Tx^xA^dfjX^))) 
P% (grad/(X J _ 1 ),grad/(^_ 1 )) ' 1 J 

In order to prove convergence, we also enforce that the search direction rji is sufficiently gradient- 
related in the sense that its angle with the gradient is never too small; see, e.g., [8j Chap. 1.2]. 

3 Implementation details 

This section is devoted to the implementation of Algorithm [TJ For most operations, we will also 
provide a flop count for the low-rank regime, i.e., k <C m < n. 

Low-rank matrices Since every element X £ Ai^ is a rank k matrix, we store it as the result 
of a compact SVD: 

X = UY,V T , 

with orthonormal matrices U G St™ and V G St^ - , and a diagonal matrix £ 6 M fcxfc with decreasing 
positive entries on the diagonal. For the computation of the objective function and the gradient, we 
also precompute the sparse matrix := Pq(X). As explained in the next paragraph, computing 
Xq costs (m + 2|r2|)/c flops. 



8 



Projection operator Po. During the course of Algorithm [TJ we require the application of Pn, 
as defined in Q, to certain low-rank matrices. By exploiting the low-rank form, this can be done 
efficiently as follows: Let Z be a vank-kz matrix with factorization Z := Y\Y^ (kz is possibly 
different from k, the rank of the matrices in Aik)- Define Zq := Pq(Z). Then element of Zq 
is given by 

( Zn ), ) = ( Efiri<U)K2(j ' ,) » W)6ft (10, 

The cost for computing Zq is 2|r2|fcz flops. 

Tangent vectors A tangent vector r] G TxMk at A = UTiV T 6 JKt will be represented as 

n = UMV T + U P V T + UV p T , (17) 

where M G R kxk , U p G M. mxk with U T U P = 0, and V p G M nxfc with F T V^ = 0. After vectorizing 
this representation, the inner product (rj,v) can be computed in 2(m + n)k + 2/c 2 flops. 

Since X is available as an SVD, the orthogonal projection onto the tangent space TxMk becomes 

V TxMk (Z) := Pu Z P v +{Z - P v Z) P v + P V (Z T - P v Z T ) T , 

where Pu := UU T and Py '■= VV T . If Z can be efficiently applied to a vector, evaluating and 



storing the result of PT x M k (Z) as a tangent vector in the format (17) can also be performed 
efficiently. 



Riemannian gradient From (11), the Riemannian gradient at X = UY,V T is the orthogonal 
projection of Pq(A — A) onto the tangent space at X. Since the sparse matrix ^ = Pq(A) is 
given and Xq = Pq(X) was already precomputed, the gradient can be computed by Algorithm [2j 
The total cost is 2(n + 2m)k 2 + 4\$l\k flops. 

Algorithm 2 Calculate gradient grad/(A) 

Require: matrix X = UY 1 V T G Mk, sparse matrix R = Xq — Aq G R mxn . 

Ensure: grad/(A) = UMV T + U P V T + UV p T G T x M k 
1-.R U ^R T U, R V ^RV ' #4|0|A: flops 

2: M U T R V # 2mk 2 flops 

3: U p ^ Ry- UM, V p ^ R u - VM T # 2(m + n)A; 2 flops 



Vector transport Let i/ = C/ MV T + C/ p y T + C/V T be a tangent vector at X = UYy T G A^/; 



By (14), the vector 

u + := Tx^x+{v) = P Tx+ M k { u ) 

is the transport of v to the tangent space at some X + = . It is computed by Algorithm [3] 

with a total cost of approximately 14(m + n)k 2 + 10/c 3 flops. 

Non- linear CG The geometric variant of Polak-Ribiere is implemented as Algorithm [4] costing 
about 28(m + n)k 2 + 20/j 3 flops. In order to improve robustness, we use the PR+ variant and 
restart when the conjugate direction is almost orthogonal to the gradient. 
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Algorithm 3 Calculate vector transport Tx -*x + {y) 



Require: matrices X = UTy T £ Mk and X + = G Mk, 

tangent vector v = UMV T + U p V T + UVj . 



Ensure: T x ^x + (") = U+M+V? + U p+ V? + U+V^ eT x+ M k 



A v 
B v 

M 

M 

M 



(i) 
4- 
(2) 
+- 
(3) 



6: M 4 



7: U, 



P+ 



v,_ 



V T V+ 

v p T v + 



A u 
B„ 



A T U MA % 
B T A 



U 

u 



u 

(2) 

+ 

(3) 



U T U- 

(i) 



U(MA V ), 

(2) 



(i) 



V{M T A U ) 



UpA v , V, 



UB V , V 



(3) 



P+ 



m| 1} + m| 2) + m| 3) 
C7 w + c/ (2) + [7 (3 )) 

F « + F p ) + y _(3) 5 



u„ 



V, 



-VB U 

-u + (ulu p+ ) 

-V+(V?V P+ ) 



# 2(m + n)A; 2 

# 2(m + n)A; 2 

# 6A; 3 + 2(m + n)/c 2 

# 2/c 3 + 2(m + ra)/c 2 

# 2fc 3 + 2(m + n)K 

#2k 2 

# 4mfc 2 

# 4nfc 2 



Algorithm 4 Compute the conjugate direction by PR+ 



# apply Algorithm 

# apply Algorithm 



Require: previous iterate previous gradient previous direction rji 

current iterate X^, current gradient & 
Ensure: conjugate direction r/j G T Xi Mk 

1: Transport previous gradient and direction to current tangent space: 

% <- 73f i _ 1 ->x i (?7i-i) 
2: Compute conjugate direction: 

<~ 6 - f i 
/3^max(0,(<^&}/(&-i,&-i}) 
r]i < & + 

3: Compute angle between conjugate direction and gradient: 

a <- {rihiil/^/im^i) 
4: Reset to gradient if desired: 
if a < 0.1 then r\i £j 



Initial guess for line search Although Algorithm [T] uses line search, a good initial guess can 
greatly enhance performance. We observed in our numerical experiments that an exact minimiza- 
tion on the tangent space alone (so, neglecting the retraction), 

mm f(X + t V ) = ^mm\\P n (X)+tP n ( V )-P n (A)\\ 2 F , (18) 
performed extremely well as initial guess in the sense that backtracking is almost never necessary. 



Equation (18) is essentially a one-dimensional least-square fit for t on $7. The closed-form 



solution of the minimizer t* satisfies 

U = < Pnfa), Pn(A - X) } / { Pnfo), Pnfa) ) 

and is unique when r/ ^ 0. As long as Algorithm [T] has not converged, r/ will always be non-zero 
since it is the direction of search. The solution to ( |18[ ) is performed by Algorithm [5| In the actual 
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implementation, the sparse matrices N and B are stored as the non-zero entries on O. Hence, the 
total cost is about 2mk 2 + 4|f2|(fc + 1) flops. 



Algorithm 5 Compute the initial guess for line search = arg min 4 f(X +trf) 

Require: iterate X = UT>V T and projection Xq, tangent vector r\ = UMV T + U P V T + UV p T , 

sparse matrix R = Aq - Xq G R mxn 
Ensure: step length t* 

1: N<-P n (\UM + U p U][V V P ] T ) # Irak 2 + A\Q\k flops 

2: U 4- tx(N T R)/tr(N T N) ' # 4|fi| flops 



Retraction As shown in ( 13 ), the retraction ( 12 ) can be directly computed by the SVD of X + £. 
A full SVD of X + £ is, however, prohibitively expensive since it costs 0(n 3 ) flops. Fortunately, 
the matrix to retract has the particular form 



x + i=[u u p 



h o 



[v v p 



with X = UY>V T G M k and £ = UMV T + C/ p y T + UV? G Ty-Mfc. Algorithm g now performs a 
compact QR and an SVD of a small 2/c-by-2/c matrix to reduce the flop count to 14(m + n)k 2 + 
Csvd^ 3 when k <C min(m, n). 

Observe that the listing of Algorithm [6] uses Matlab notation to denote common matrix op- 
erations. In addition, the flop count of computing the SVD in Step 3 is given as Csvd& 3 since it 
is difficult to estimate beforehand in general In practice, the constant Csvd is modest, say, less 
than 200. Furthermore, in Step 4, we have added e mac h to £ s so that, in the unlucky event of zero 
singular values, the retracted matrix is perturbed to a rank- A; matrix in Mk- 



Algorithm 6 Compute the retraction by metric projection 
Require: iterate X = UT,V T , tangent vector £ = UMV T + U p V T + UV^ 
Ensure: retraction R x (0 = ?M k {X + €) = U + T, + V^ 



1: 


(Qu,R u ) <- qr(L7p,0), (Q V ,R V )<- 


-qr(V p ,0) 


# 10(m + n)k 2 flops 


2: 




£ + M R T V 
Ru 








3: 




s,V 8 ) <-svd{S) 




# Csvd A; 3 flops 


4: 




S s (l :k,l:k) + e mac h 






5: 


U+<- 


[U Qu]U s ( 


:,l:k), V + i 


- [V Q v ]V s (:,l:k) 


# 4(m + n)k 2 flops 



Computational cost Summing all the flop counts for Algorithm [TJ we arrive at a cost per 
iteration of approximately 

(48m + Un)k 2 + 10|n|fc + nb A rmijo (8(m + n)k 2 + 2\Q\k), 

where nbArmijo denotes the average number of Armijo backtracking steps. It is remarkable, that in 
all experiments below we have observed that nbArmijo = 0, that is, backtracking was never needed. 
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For our typical problems in Section[5j the size of the sampling set satisfies |0| = OS (m + n — k)k 
with OS > 2 the oversampling factor. When m = n and nbArmijo = 0, this brings the total flops 
per iteration to 

theoretical = (92 + 20 OS)nk 2 . (19) 

Comparing this theoretical flop count with experimental results, we observe that sparse matvecs 



and applying Pq require significantly more time than predicted by (19). This can be explained 
due to the lack of data locality for these sparse matrix operations, whereas the majority of the 
remaining time is spent by dense linear algebra, rich in BLAS3. After some experimentation, we 
estimated that for our Matlab environment these sparse operations are penalized by a factor 
of about C sparse ~ 5. For this reason, we normalize the theoretical estimate (19) to obtain the 
following practical estimate: 

iJS CG = (92 + C Bpaxse 20 OS)nk 2 , C sparsc ~ 5. (20) 

For a sensible value of OS = 3, we can expect that about 75 percent of the computational time will 
be spent on operations with sparse matrices. 

Comparison to existing methods The vast majority of specialized solvers for matrix com- 
pletion require computing the dominant singular vectors of a sparse matrix in each step of the 
algorithm. This is, for example, the case for approaches using soft- and hard-thresholding, like 



in [10 , 15 24 38 41 46 53 . Typically, PRO PACK from 34 is used for computing such a truncated 
SVD. The use of sparse matrix- vector products and the potential convergence issues frequently 
makes this the computationally most expensive part of all these algorithms. 

On the other hand, our algorithm first projects any sparse matrix onto the tangent space 
and then computes the dominant singular vectors for a small 2k-by-2k matrix. Apart from the 
application of P^ and a few sparse matrix-vector products, the rest of the algorithm solely consists 
of dense linear algebra operations. This is more robust and significantly faster than computing 
the SVD of a sparse matrix in each step. To the best of our knowledge, LMAFit [56] and other 
manifold- related algorithms like 7,22, 30 , 3~T||43] are the only competitive solvers where mostly dense 
linear algebra together with a few sparse matrix- vector products are sufficient. 

Due to the difficulty of estimating practical flop counts for algorithms based on computing 
a sparse SVD, we will only compare the computational complexity of our method with that of 
LMAFit. An estimate similar to the one of above, reveals that LMAFit has a computational cost 
of 

^practical = (^^ ^sparse 12 OS)n/c , C S p arsc ~ 5, (21) 

where the sparse manipulations involve 2k sparse matvecs and one application of Pq to a rank k 
matrix. 

Comparing this estimate to ( |20[ ), LMAFit should be about two times faster per iteration when 
OS = 3. As we will see later in Tables [T] and [2j this is almost exactly what we observe experimentally. 
Since LRGeomCG is more costly per iteration, it will have to converge much faster in order to 
compete with LMAFit. This is studied in detail in Section [5} 

4 Convergence for a modified cost function 

In this section, we show the convergence of Algorithm [T] applied to a modified cost function. 
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4.1 Reconstruction on the tangent space 

As first step, we apply the general convergence theory for Riemannian optimization in |2i]. In 
particular, since Rx is a smooth retraction, the Armijo-type line search together with the gradient- 
related search directions, allows us to conclude that any limit point of Algorithm [I] is a critical 
point. 

Proposition 4.1 Let Xi be an infinite sequence of iterates generated by Algorithm^ Then, every 
accumulation point A* of Xi satisfies Pr x ,M k Pn(A*) = Pt* Pfi(A). 

Proof. From Theorem 4.3.1 in j2j, every accumulation point is a critical point of the cost function 
/ of Algorithm [I] These critical points A* arc determined by gvadf(X) = 0. By (11) this gives 
?T x , Mk Pn(X* - A) = 0. □ 

The next step is establishing that there exist limit points. However, since is open — its 
closure are the matrices of rank bounded by k — it is not guaranteed that any infinite sequence has 
indeed a limit point in Mk- To guarantee that there exists at least one such point in the sequence 
of iterates of Algorithm [TJ we want to exploit that these iterates stay in a compact subset of Mk- 

It does not seem possible to show this for the objective function / of Algorithm [T] without 
modifying the algorithm or making further assumptions. We therefore show that this property 
holds when Algorithm^ is applied not to f but the regularized version 

g:M^R,X^ f(X) + ^(\\X^f F + \\X\\ 2 F ), p > 0. 

Since every X £ Mk is of rank k, the pseudo- inverse is smooth on Mk and hence g(X) is also 
smooth. Using 26 Thm. 4.3], one can show that the gradient of X i— > ||A'||p + ||A||p equals 



2t/(£ - T,~ 3 )V T for X = UT,V T ; hence, gradc/(A) can be computed very cheaply after modifying 
step 2 in Algorithm [2j 

Proposition 4.2 Let Xi be an infinite sequence of iterates generated by Algorithm^ but with the 
objective function g(X) = /(A)+^ 2 (||X"''|||,-|-||A|||,) forsomeO < fi < 1. Then, lim^oo Pt x M k Pc(Aj 

A) = 2^(x} + Xi). 

Proof. We will first show that the iterates stay in a closed and bounded subset of Mk- Let 
L = {X £ Mk '■ d(X) < g(Xo)} be the level set at Ao. By construction of the line search, all Xi 
stay inside L and we get 

1 



9 Pn(Aj — A)||p + /r||X/||p + /i ||Aj|| F < C$, i > 0, 



-tl|2 

" ^IIF T H- IK 1 

where Cq := g(Xo). This implies 



M 2 M! < Co - ^11 Pn(Xi - A)f F - fi 2 \\xj\\ F < Cl 
and we obtain an upper bound on the largest singular value of each Xf. 

o-i(Xi)<\\X t \\ F <C /ii=:C a . 
Similarly, we get a lower bound on the smallest singular value: 

k .2 



Ml|AT|| F = ^ < Ci - - || Po(A, - A)\\ F - v*\\X^ < Ct 
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which implies that 

(Tk(Xi) > fi/Co =: C a . 
It is clear that all Aj stay inside the set 

B = {X £ Mk ■ ai (X) < C a , a k {X) > C a }. 

This set is closed and bounded, hence compact. 

Now suppose that lim^oo || gradg(Aj)||F / 0. Then there is a subsequence in {Xj}, g x: such 
|| grad g(X)||F > e > for alH G IC. Since Xi G B, this subsequence {Aj}j g £ has a limit point X* in 
-B. By continuity of gradg, this implies that || grad <7(A„,)||f > £ which contradicts Theorem 4.3.1 
in [2] that every accumulation point is a critical point of g. Hence, lim^oo || gradg(Aj)||F = 0. 
□ 

In theory, /i 2 can be very small, so one can choose it arbitrarily small, for example, as small as 
£mach — 10~ 16 . This means that as long as <7i(Aj) = O(l) and a^Xi) ^> e mac h, the regularization 
terms in g are negligible and one might as well have optimized the original objective function /, 
that is, g with /x = 0. This is what we observed in all numerical experiments when rank(^4) > k. 
In case rank(j4) < k, it is obvious that now <7fc(Aj) — > as i — > oo. Theoretically one can still 
optimize g, although in practice it makes more sense to transfer the original optimization problem 
to the manifold of rank k — 1 matrices. Since these rank drops do not occur in our experiments, 
proving convergence of such a method is beyond the scope of this paper. 

A similar argument appears in the analysis of other methods for rank constrained optimization 
too. For example, the proof of convergence in 14 for SDPLR |13| uses a regularization by /j, det(S) 



with S containing the non-zero singular values of X; yet in practice, ones optimizes without this 
regularization using the observation that [i can be made arbitrarily small and in practice one always 



observes convergence. Analogous modifications also appear in [10 , 11 30 . 
4.2 Reconstruction on the whole space 



Based on Proposition 4.2 we have that, for a suitable choice of fi, any limit point satisfies 

||PT x ,^ ft Pn(A, - A)|| F <e, e>0. 

In other words, we will have approximately fitted X to the data A on the image of Pt x Mk and, 
hopefully, this is sufficient to have A* = A on the whole space W mxn . Without further assumptions 
on A* and A, however, it is not reasonable to expect that A* equals A since Pq usually has a very 
large null space. 

Let us split the error E := A — A into the three quantities 

E 1 =P TxMk P n (X -A), E 2 = P NxMk P n (X -A), and E 3 = Pq(X — A), 

where Nx-Mk is the normal space of A, i.e., the subspace of all matrices Z G ]g mxn orthogonal to 
T x M k . 

Upon convergence of Algorithm [TJ ||i?||F - > but how big can E 2 be? In general, E 2 can be 
arbitrary large since it is the Lagrange multiplier of the fixed rank constraint of A. In fact, when 
the rank of A is smaller than the exact rank of A, E 2 typically never vanishes, even though E\ 
can converge to zero. On the other hand, when the ranks of A and A are the same, one typically 
observes that H-EHf —> implies \\E\\p — > 0. This can be easily checked during the course of the 
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algorithm, since Pn(A — A) = E\ + E2 is computed explicitly for the computation of the gradient 
in Algorithm [2j 

Suppose now that ||-Ei||f < £ an d H-E^Hf < T , which means that X is in good agreement with 
A on f2. The hope is that X and A agree on the complement of £1 too. This is of course not 
always the case, but it is exactly the crucial assumption of low-rank matrix completion that the 
observations on Q are sufficient to complete A. To quantify this assumption, one can make use of 
standard theory in matrix completion literature; see, e.g., 16 T8| 3l] . Since it is not the purpose 
of this paper to analyze the convergence properties of Algorithm TJ for various, rather theoretical 
random models, and we do not rely on this theory in the numerical experiments later on, we omit 
this discussion. 



5 Numerical experiments 

The implementation of LRGeomCG, i.e., Algorithm [TJ was done in Matlab R2012a on a desktop 
computer with a 3.10 GHz CPU and 8 GB of memory. All reported times are wall-clock time that 
include the setup phase of the solvers (if needed) but exclude setting up the (random) problem. 



As observed in 10 , the performance of most matrix completion solvers behaves very differently 
for highly non-square matrices; hence, as is mostly done in the literature, we focus only on square 
matrices. In our numerical experiments below we only compare with LMAFit of |56| . Extensive 
comparison with other approaches based in [l5 24 
fastest method for square matrices. We refer to 



45 



38-41,53 revealed that LMAFit is overall the 



for these results. 



The implementation of LMAFit was the one provided by the authors^ All options were kept the 



same, except rank adaptivity was turned off. In Sections 5.4 and |5.5| we also changed the stopping 
criteria as explained there. In order to have a fair comparison between LMAFit and LRGeomCG, 
the operation Pq was performed by the same Matlab function part_XY.m in both solvers. In 
addition, the initial guesses were taken as random rank k matrices (more on this below). 

In the first set of experiments, we will complete random low-rank matrices which are generated 



as proposed in 15 . First, construct two random matrices Al,Ar G M nxfc with i.i.d. standard 
Gaussian entries; then, assemble A := A^A^ finally, the set of observations Q is sampled uniformly 
at random among all sets of cardinality |f2|. The resulting observed matrix to complete is now 
Aq := Pq(A). Standard random matrix theory asserts that \\A\\p ~ n\/k and ||j4oJ|f — \/|0|A;. In 
the later experiments, the test matrices are different and their construction will be explained there. 
When we report on the relative error of an approximation X, we mean the error 

relative error = \\X - A\\ F /\\A\\ F (22) 

computed on all the entries of A. On the other hand, the algorithms will compute a relative residual 
on Q only: 

relative residual = \\P Q {X - A)\\ F /\\P Q (A)\\ F . (23) 

Unless stated otherwise, we set as tolerance for the solvers a relative residual of 10 -12 . Such a 
high precision is necessary to study the asymptotic convergence rate of the solvers, but where 
appropriate we will draw conclusions for moderate precisions too. 

As initial guess X\ for Algorithm [TJ we construct a random rank-/c matrix following the same 
procedure as above for M. The oversampling factor (OS) for a rank-/c matrix is defined as the ratio 



1 Version of August 14, 2012 downloaded from |http: //lmaf it . blogs . rice . edu/| 
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of the number of samples to the degrees of freedom in a non-symmetric matrix of rank k, 

0S = |fi|/(fc(2n-Jfe)). (24) 

Obviously one needs at least OS > 1 to uniquely complete any rank-fc matrix. In the experiments 
below, we observe that, regardless of the method, OS > 2 is needed to reliably recover an incoherent 
matrix of rank k after uniform sampling. In addition, each row and each column needs to be 
sampled at least once. By the coupon's collector problem, this requires that |fi| > Cnlog(n) but 
in all experiments below, is always sufficiently large such that each row and column is sampled 
at least once. Hence, we will only focus on the quantity OS. 



5.1 Influence of size and rank for fixed oversampling 

As first test, we complete random matrices A of exact rank k with LRGeomCG and LMAFit 
and we explicitly exploit the knowledge of the rank in the algorithms. The purpose of this test 
is to investigate the dependence of the algorithms on the size and the rank of the matrices. In 
Section |5.3[ we will study the influence of OS, but for now we fix the oversampling to OS = 3. 
In Table [TJ we report on the mean run time and the number of iterations taking over 10 random 
instances. The run time and the iteration count of all these instances are visualized in Figures [3] 
and[3J respectively. 

Overall, LRGeomCG needs less iterations than LMAFit for all problems. In Figure [3j we 
see that this is because LRGeomCG converges faster asymptotically although the convergence of 
LRGeomCG exhibits a slower transient behavior in the beginning. This phase is, however, only 
limited to the first 20 iterations. Since the cost per iteration for LMAFit is cheaper than for 



LRGeomCG (see the estimates (20) and (21)), this transient phase leads to a trade-off between 
runtime and precision. This is clearly visible in the timings of Figure [4j When sufficiently high 
accuracy is requested, LRGeomCG is always faster, whereas for low precision, LMAFit is faster. 

Let us now check in more detail the dependence on the size n. It is clear from the left sides 
of Table [l] and Figure [3] that the iteration counts of LMAFit and LRGeomCG grow with n but 
seem to stagnate as n — > oo. In addition, LMAFit needs about three times more iterations than 
LRGeomCG. Hence, even though each iteration of LRGeomCG is about two times more expensive 
than one iteration of LMAFit, LRGeomCG will eventually always be faster than LMAFit for 
sufficiently high precision. In Figure |4] on the left, one can, for example, see that this trade-off 
point happens at a precision of about 10 -5 . For growing rank k, the conclusion of the same 
analysis is very much the same. 

The previous conclusion depends critically on the convergence factor and, as we will see later 
on, this factor is determined by the amount of oversampling OS. But for difficult problems (OS = 3 
is near the lower limit of 2), we can already observe that LRGeomCG is about 50% faster than 
LMAFit. 



5.2 Hybrid strategy 

The experimental results from above clearly show that the transient behavior of LRGeomCG's 
convergence is detrimental if only modest accuracy is required. The reason for this slow phase is 
the lack of non-linear CG acceleration {(3 in Algorithm [4] is almost always zero) which essentially 
reduces LRGeomCG to a steepest descent algorithm. On the other hand, LMAFit is much less 
affected by slow convergence during the initial phase. 
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Figure 3: Convergence curves for LMAFit (dashed line) and LRGeomCG (full line) for fixed over- 
sampling of OS = 3. Left: variable size n and fixed rank k = 40; right: variable ranks k and fixed 
size n = 8000. 



Table 1: The mean of the computational results for Figure |3f|4| for solving with a tolerance of 10 





Fixed rank k 


= 40 
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size n = 


8000 






LMAFit 


LRGeomCG 


LMAFit 


LRGeomCG 




n 


time(s.) 


#its. 


time(s.) 


#its. 


k 


time(s.) 


#its. 


time(s.) 


#its. 
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3.11 
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2.74 


54.5 
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8.81 
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14.8 


66.7 
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25.8 


76.1 
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53.3 


211 


36.5 


71.7 


40 


52.7 


211 


35.8 


71.7 


16000 
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222 


99.1 


75.4 


50 


76.1 


194 


51.2 


67.7 


32000 


383 


233 


254 


79.1 


60 


96.6 


186 


67.0 


66.1 



A simple heuristic to overcome this bad phase is a hybrid solver: For the first I iterations, we 
use LMAFit; after that, we hope that the non-linear CG acceleration kicks in immediately and we 
can efficiently solve with LRGeomCG. In Figure [5j we have tested this strategy for different values 
of / on a problem of the previous section with n = 16000 and k = 40. 

From the left panel of Figure [HJ we get that the run time to solve for a tolerance of 10~ 12 
is reduced from 97 sec. (LRGeomCG) and 147 sec. (LMAFit) to about 80 sec. for the choices 
I = 10, 20, 30, 40. Performing only one iteration of LMAFit is less effective. For / = 20, the hybrid 
strategy has almost no transient behavior anymore and it is always faster than LRGeomCG or 
LMAFit alone. Furthermore, the choice of I is not very sensitive to the performance as long as it is 
between 10 and 40, and already for I = 10, there is a significant speedup of LRGeomCG noticeable 
for all tolerances. 

The quantity /3 of PR+ in Algorithm [4] is plotted in the right panel of Figure [5j Until the 
20th iteration, plain LRGeomCG shows no meaningful CG acceleration. For the hybrid strategies 
with I > 10, the acceleration kicks in almost immediately. Observe that all approaches converge 
to /3 ~ 0.4. 
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Figure 4: Timing curves for LMAFit (dashed line) and LRGeomCG (full line) for fixed oversampling 
of OS = 3. Left: variable size n and fixed rank k = 40; right: variable ranks k and fixed size 
n = 8000. 



While this experiment shows that there is potential to speed up LRGeomCG in the early phase, 
we do not wish to claim that the present strategy is very robust nor directly applicable. The 
main point that we wish to make, however, is that the slow phase can be avoided in a relatively 
straightforward way, and that LRGeomCG can be warm started by any other method. In particular, 
this shows that LRGeomCG can be efficiently employed once the correct rank of the solution 
X is identified by some other method. As explained in the introduction, there are numerous 
other methods for low-rank matrix completion that can reliably identify this rank but have a slow 
asymptotic convergence. Combining them with LRGeomCG seems like a promising way forward. 

5.3 Influence of oversampling 

Next, we investigate the influence of oversampling on the convergence speed. We took 10 random 
problems with fixed rank and size, but vary the oversampling factor OS using 50 values between 
1, . . . , 15. In Figure [6] on the left, the asymptotic convergence factor p is visible for LMAFit and 
LRGeomCG for three different combinations of the size and rank. Since the convergence is linear 
and we want to filter out any transient behavior, this factor was computed as 

_ / iip n (^ end -A)i| F y/^- 10 ) 

9 V ||PnPrio-A)|| F J 

where i en d indicates the last iteration. A factor of one indicates failure to converge within 4000 
steps. One can observe that all methods become slower as OS — > 2. Further, the convergence factor 
of LMAFit seems to stagnate for large values of OS while LRGeomCG's actually becomes better. 
Since for growing OS the completion problems become easier as more entries are available, only 
LRGeomCG shows the expected behavior. In contrast to our parameter-free choice of non-linear 
CG, LMAFit needs to determine and adjust a certain acceleration factor dynamically based on 
the performance of the iteration. We believe the stagnation in Figure [6] on the left is due to a 
suboptimal choice of this factor in LMAFit. 
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Figure 5: Timing curve and (5 coefficient of the hybrid strategy for different values of /. 



In the right panel of Figure [6j the mean time to decrease the relative residual by a factor of 10 
is displayed. These timings were again determined by neglecting the first 10 iterations and then 
interpolating the time needed for a reduction of the residual by 10. Similarly as before, we can 
observe that since the cost per iteration is cheaper for LMAFit, there is a range for OS around 
7... 9 where LMAFit is only slightly slower than LRGeomCG. However, for smaller and larger 
values of OS, LRGeomCG is always faster by a significant margin (observe the logarithmic scale). 



5.4 Influence of noise 

Next, we investigate the influence of noise by adding random perturbations to the rank-/c matrix 
A. We define the noisy matrix A& with noise level e as 



where A is a standard Gaussian matrix and Q is the usual random sampling set (see also 53 56] 
for a similar setup). The reason for defining A^ e > in this way is that we have 

|| P n (A - A^)\\ F = e J^fcl Pn(A0l| F * e y/\Q\k. 



IF 

So, the best relative residual we may expect from an approximation X opt ~ A is 

|| P a {X^ - A^)\\ F /\\ Pn(A)\\ F * || P n (A - A&)\\ F /\\ P a (A)\\ F ~ e. (25) 
Similarly, the best relative error of X opt should be on the order of the noise ratio too since 

\\A- A&\\ F /\\A\\ F ~e. (26) 



Due to the presence of noise, the relative residual (22) cannot go to zero but the Rieman- 
nian gradient of our optimization problem can. In principle, this suffices to detect convergence of 
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Figure 6: Asymptotic convergence factor p (left) and mean time to decrease the error by a factor 
of 10 (right) for LMAFit (dashed line) and LRGeomCG (full line) in function of the oversampling 
rate OS. 



LRGeomCG. In practice, however, we have noticed that this wastes a lot of iterations in case the 
iteration stagnates. A simply remedy is to monitor 



relative change at step i 



1 - (27) 



and stop the iteration when this value drops below a certain threshold. After some experimentation, 
we fixed this threshold to 10 -3 . Such a stagnation detection is applied by most other low-rank 



completion solvers 53 56 , although its specific form varies. Our choice coincides with the one 
from |56] , but we have lowered the threshold. 

After equipping LMAFit and LRGeomCG with this stagnation detection, we can display the 
experimental results on Figure [7] and Table [2} We have only reported on one choice for the rank, 
size and oversampling since the same conclusions can be reached for any other choice. Based on 
Figure [7| it is clear that the stagnation is effectively detected by both methods, except for the very 
noisy case of e = 1. (Recall that we changed the original stagnation detection procedure in LMAFit 
to ours of ( [27] ), to be able to draw a fair comparison.) 

Comparing the results with those of the previous section, the only difference is that the iteration 
stagnates when the error reaches the noise level. In particular, the iterations are undisturbed by 
the presence of the noise up until the very last iterations. In Table [2j one can observe that both 



methods solve all problems up to the noise level, which is the best that can be expected from (25) 



and (26). Again, LRGeomCG is faster when a higher accuracy is required, which, in this case, 
corresponds to small noise levels. Since the iteration is unaffected by the noise until the very end, 
the hybrid strategy of the previous section should be effective here too. 



5.5 Exponentially decaying singular values 

In the previous problems, the rank of the matrices could be unambiguously defined. Even in the 
noisy case, there was still a sufficiently large gap between the original nonzero singular values and 
the ones affected by noise. In this section, we will complete a matrix for which the singular values 
decay but do not become exactly zero. 
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Timing curve: n = 8000, k = 20, OS = 3 
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Figure 7: Timing curves for LMAFit (crosses) and LRGeomCG (circles) for different noise levels 
with size n = 8000, rank k = 20, fixed oversampling of OS = 3. The noise levels e are indicated by 
dashed lines in different color. 

Table 2: Computational results for Figure [7} 
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Although a different setting than the previous problems, this type of matrices occurs frequently 
in the context of approximation of discretized functions or solutions of PDEs on tensorized domains; 
see, e.g., [33J . In this case, the problem of approximating discretized functions by low-rank matrices 
is primarily motivated by reducing the amount of data to store. On a continuos level, low-rank 
approximation corresponds in this setting to approximation by sums of separable functions. As 
such, one is typically interested in approximating the matrix up to a certain tolerance using the 
lowest rank possible. 

The (exact) rank of a matrix is now better replaced by the numerical rank, or e-rank |25[ Chapter 
2.5.5]. Depending on the required accuracy e, the e-rank is the quantity 

k e (A) = min rank(S). (28) 

||A-B||2<£ 

Obviously, when e = 0, one recovers the (exact) rank of a matrix. It is well known that the e-rank 
of A £ u mxn can be determined from the singular values Oi of A as follows 

o-i > o"fc e > e > (Tke+i > • ■ • > fp, P = min(m, n). 
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In the context of the approximation of bivariate functions, the relation ( |28| ) is very useful in 
the following way. Let f(x,y) be a function defined on a tensorized domain (x,y) £ X x y. After 
discretization of x and y, we arrive at a matrix A. If we allow for an absolute error in the spectral 



norm of the size e, then (28) tells us that we can approximate M by a rank k £ matrix. 



As example, we take the following simple bivariate function to complete: 

f(x,y) 



o~ + \\x 



112 ' 

v\\i 



(29) 



for some a > 0. The parameter a controls the decay of the singular values. Such functions occur 
as two-point correlations related to second-order elliptic problems with stochastic source terms; 
see [27]. 

We ran LRGeomCG and LMAFit after discretizing (29) with a matrix of size n = 8000 using 
an oversampling factor OS of 8 (calculated for rank 20) and a decay of a = 1. Since a stopping 
condition on the residual has no meaning for this example, we only used a tolerance of 10 -3 on the 



relative change (27) or a maximum of 500 iterations. In Figure |8j the final accuracy for different 
choices of the rank is visible for these two problem sets. 

The results labeled "(no horn)" use the standard strategy of starting with random initial guesses 
for each value of the rank k. Clearly, this results in a very unsatisfactory performance of LRGeomCG 
and LMAFit since the error of the completion does not decrease with increasing rank. Remark that 
we have measured the relative error as 

||P r (X.-A)|| P /||P r (A)[| F 

where T is a random sampling set different from f2 but equally large. 
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Figure 8: Relative error of LMAFit and LRGeomCG after completion of a discretization of (29) 
for OS=8 with and without a homotopy strategy. 

The previous behavior is a clear example that the local optimizers of LMAFit and LRGeomCG 
are very far away from the global ones. However, there is a straightforward solution to this problem: 
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Instead of taking a random initial guess for each k, we only take a random initial guess for k = 1. 
For all other k > 1, we use X = UT,V T where 



X = UTiV T is the local optimizer for rank k — 1 and u and v are random unit-norm vectors 
orthogonal to U and V, respectively. We call this the homotopy strategy which is also used in 
one of LMAFit's rank adaptivity strategies (but with zero vectors u and v). The error using this 
homotopy strategy is labeled "(horn)" in Figure [8] and it clearly performs much more favorably. In 
addition, LRGeomCG is able to obtain a smaller error. 



V 



V v 



6 Conclusions 

The matrix completion consists of recovering a low-rank matrix based on a very sparse set of entries 
of this matrix. In this paper, we have presented a new method based on optimization on manifolds 
to solve large-scale matrix completion problems. Compared to most other existing methods in the 
literature, our approach consists of directly minimizing the least-square error of the fit directly over 
A4k, the set of matrices of rank k. 

The main contribution of this paper is to show that the lack of vector space structure of Aik 
does not need be an issue when optimizing over this set, and illustrating this for low-rank matrix 
completion. Using the framework of retraction-based optimization, the necessary expressions were 
derived in order to minimize any smooth objective function over the Riemannian manifold Mk- The 
geometry chosen for Aik, namely a submanifold embedded in the space of matrices, allowed for an 
efficient implementation of non-linear CG for matrix completion. Indeed, the numerical experiments 
illustrated that this approach is very competitive with state-of-the-art solvers for matrix completion. 
In particular, the method achieved very fast asymptotic convergence factors without any tuning of 
parameters thanks to a simple computation of the initial guess for the line search. 

A drawback of the proposed approach is however that the rank of the manifold is fixed. Although 
there are several problems where choosing the rank is straightforward, an integrated manifold-based 



solution, like in 19,46 , is desirable. In addition, our convergence proof relied on several safeguards 
and modifications in the algorithm that seem completely unnecessary in practice. Closing these 
gaps are currently topics of further research. 
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A Proof of Proposition 2.3 



The proof of Prop. |2.3| consists of two steps. First, we introduce a second-order retraction which 
allows us to use standard Euclidean derivatives to derive the Riemannian Hessian of any objective 
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function. Then, we derive a second-order model of the specific function / based on a second- 
order Taylor expansion using this retraction. This second-order model will give us the Riemannian 
Hessian. 



A.l An explicit second-order retraction on Aik 

The Riemannian Hessian of an objective function / is usually defined in terms of the Levi-Civita 
connection, but in case of an embedded submanifold it can also be defined by means of so-called 
second-order retractions. 

Second-order retractions are defined as second-order approximations of the exponential map (2J 
Proposition 3]. In case of embedded submanifolds, it is shown in |2j that they can be identified as 
retractions that additionally satisfy 

F ™ (^ Rx ^\t=o) = °' f ° r aU * G TxMk ' (30) 
Now, the Riemannian Hessian operator, a symmetric and linear operator 

Hess/(X) :T x M k ^T x M k , 

can be determined from the classical Hessian of the function / o R x : TxMk K defined on a 
vector space since 

Hess/(X) = Hess(/oii x )(0), 

where Rx is a second-order retraction, see (2j Proposition 5.5.5]. 

The next proposition gives a second-order retraction and the plan of this construction is as 
follows. First, we construct a smooth mapping TxMk M mxn ; then, we prove that it satisfies the 
requirements in Definition 2.2 and (30); and finally, we expand in series. Since the metric projection 
(12) can be shown to satisfy (30) also, we have in addition obtained a second-order Taylor series 
for (ph. 



Proposition A.l For any X = UT>V T G Aik satisfying ^ and £ G TxMk satisfying the 
\ by 

R® :T x M k ^Mk, £ = UMV T + U P V T + UV? ^ Z V Z%, (31) 



mapping R x \ given by 



where 

Z v := C/(E + \M - lMY,- l M) + U p (I k - §E -1 M), (32) 
Z v := V{I k + \M T YT T - \M T YT T M T Yr T ) + V V (^T T - \Yr T M T YT T ), (33) 

is a second-order approximation of the exponential map on {g X ,Mk)- Furthermore, we have 

R%\t)=X + t + U p X- 1 vT + 0(||£|| 3 ), IKII^O, (34) 

and 

R ( *\0-Rx(0 = o(U\\ 3 ), UW^o, (35) 



with Rx the metric projection (12). 
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Before proving Proposition |A.1[ we first state the following lemma that shows smoothness of 
R x in £ for fixed X. 

Lemma A. 2 Using the notation of Proposition pO[ we have 



i?<?(0 = WX*W, 



(36) 



with 



and 



W:=x + ±p x (0 + p x (0 

- Ip s x (0^p x (0 - ip^(0^P3f(0 - J Pi (0^1^(0, 



P^r^^TjMfciZ^PcZPv, (37) 
P p x :R nxm ^T x M k : Z^P^ZP v + PuZP v . (38) 

Furthermore, for fixed X £ .Mfc, i/ie mapping R x (£) is smooth in £ as a mapping from T x Aik to 



Proof Substituting Jft = VE" 1 */ 7 and the definitions (1371),^ for P S X ,P X in W gives 



W = UHV T + \UMV T + U P V T + VVjf - \UMV T VYr x U T UMV T 

- Mu p v T + uvI)vh- x u t umv t - \umv t vyt x u t u v v t + uvl, 



UZV T + \UMV T + U P V T + UVp 



1 



UMT, 



\U v Yr x MV T 



\UMYT X V^. 



This shows the equivalence of (36) to (31) after some tedious but straightforward algebraic manip- 
ulations. 



Observe that (36) consists of only smooth operations involving X: taking transposes, multi- 



plying matrices and inverting full rank matrices can all be expressed as rational functions without 

(2) 

singularities in the matrix entries. From this, we have immediately smoothness of R x (£) in £. 

□ 

i mxn . For Proposition 



(2) 

The previous lemma shows smoothness of R x : T X M.\ Z 
need to prove local smoothness of R^ as a function from the tangent bundle onto Mk- We will 
do this by relying on Boman's theorem j9j Theorem 1] , which states that a function / : ~R d — > M is 
smooth if / o j is smooth along all smooth curves 7 : M. — > R d . 



A.l 



we also 



In addition, for the proof below we will allow the matrix £ in definition (31 ) to be any full-rank 



matrix. This is possible because none of the derivations above relied on £ being diagonal and 



(2) 

definition (31 ) for R x is independent of the choice of specific matrices U, V, £ in the factorization 



of X. Indeed, suppose that we would have chosen U = UQu, V = VQv an d £ = QjjY*Qy as 
factorization matrices. Then the coefficient matrices of the tangent vector £ would have been M = 
QjjMQy, Up = UpQv and V = V p Qu- Now, it is straightforward to check that the corresponding 
expressions for (32)-(33) become = ZjjQv and Z y = ZyQy. Hence, R x (£) = ZjjZ'- = ZjjZy 
stays the same. 
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Proof of Proposition A.l First, we show (34). Elementary manipulation using (32)-(33) reveals 



that 



S + M + 0(||M|| 3 ) I + 0{\\M\\ 2 )' 
I + 0(\\M\\ 2 ) S-i + OdlMH) 



[V V P ] T , \\M\ 



0. 



For fixed X, the norms ||M||, \\U P \\, \\V P \\ are all 0(||£||), and so 



R ( x\0 = u & + M)V T + UV p T + U P V T + U P Y, 

(2) 



v p T + o(Uf), na^o, 



2.2 



which can be rewritten as (|34j). Next, we show that R x is a retraction according to Definition 
By construction, the image of Rx is a matrix ZjjZy of rank at most k. Since rank(X) is continuous 

(2) 

in X, one can take St > sufficiently small such that the image of t \- > R x (t£) is always of rank k 
for all t £ (-St, S t ). 

(2) 

Finally, we establish smoothness of R x in a neighborhood of (X, 0) E TA4 k - Take any (Y, £) 6 
TMk sufficiently close to (X, 0) and consider any smooth curve 7 : M — > TAik that connects 
(X, 0) to (Y,£). Consider the partitioning 7(i) = (a(t), j3(t)) with a a smooth curve on Aik and 
/3 on T a ^Ai k . By the existence of a smooth SVD for fixed-rank matrices [20| Theorem 2.4], there 
exists smooth U(t), V(t), E(i) (like mentioned above , is now a full-rank matrix) such that 

U (t)T,(t)V (t) T . Furthermore by Lemma 



a(t) 



(2) 

it will be smooth along R a ^((3(t)). Then, the expression R a ^(P(t)) consists of smooth matrix 
operations that depend smoothly on t. Since the curve *y(t) = (a(t), f3(t)) was chosen arbitrarily, we 
have shown smoothness of Rx(0 on the tangent bundle in a neighborhood of (X, 0). To conclude, 
observe from (34) that R x obeys (30) since UpYt^Vp is orthogonal to the tangent space. This 



A.2 



(2) 

map R a / t \(0 is smooth in £ around zero, so 



(2) 

makes R x a second-order retraction. 



□ 

Although parts of the pre viou s results can also be found in 50 , Theorem 1] , we prove additionally 

(2) I I — 

• " n - i; " ;i 2.2 by showing smoothness on the whole tangent bundle (and not only 



that R x ' satisfies Definition 
for each tangent space separate 



y, as in |50 



A.2 Second-order model of / on Aik 

(2) 

Based on the Taylor series of R' x ', it is now straightforward to derive the second-order model 

m(X) = f(X) + (grad/(X), + ^(Hess f(X)[£], 

of any smooth objective function / : Aif. — >■ R. In this section, we will illustrate this with our 
objective function f(X) = ±|| P Q (X - A)||f . 

(2) 1 1 

Expanding f(R" x '(0) 

using (I34J) gives for ||£|| — » 
f(R i ?(0)= 1 2 \\Pn(R i x ) (0-A)\\ 2 F 

= 1 -\\P n (x + c + u v ^- x v v T - A)ft + 0(U\\ 3 ) 

= I tr[P n (X - A) T P n (X -A) + 2 P n (X - A) T P n (£) 



+ 2P Q (X - AfPn^pE- 1 ^) + Pn(0 T ?n(0] + 0(ll£ll 3 )- 
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We can recognize the constant term 

f(X) = ^tv[P n (X-A) T Pa(X-A)], 

the linear term 

(grad/(X), = tv[P Q (X - A) T P n (0], (39) 

and the quadratic term 

(Hess /(*)[£], = tv[2P n (X -AfPniUp^V^+Pn^fPniO}- (40) 



Using ( 11 ), we see that our expression for the Riemannian gradient, grad f(x) = PT x M k (Pn(X— 
A)), indeed satisfies (39). This only leaves determining the Riemannian Hessian as the symmetric 
and linear mapping 

Hess/(X) : T x M k -> T x M k , £ ^ Hess /(*)[£], 
that satisfies the inner product (40). First note that U p T,~ 1 V p T is quadratic in £ since 

u p ?r x v£ = (u p v T + uv p T )(vx- 1 u T )(u p v T + uvj) = p x {£>^ p*(0- 

Now we get 

(Hess /(*)[£], = 2(P^(X - A), P n (I* (flJftpP + <P„(fl, P n (0) 
= 2(P n (X - A), P\ (O^t I* (0> + (Pn(0, 0- 

S v ' S v ' 

A:=(Wx(0,€> ya:=(«a(0,0 

So, the inner product with the Hessian consists of two parts, /i and /2, and the Hessian itself is 
the sum of two operators, Hi and %2- The second contribution to the Hessian is readily available 
from fi\ 

h = (Pn(0, = ( p n(0, Pt x m*(0> = (^T x M k ^(0, 0, 

and so 

%(£) := P Tx ^ fc Pn(0 = Pi/ Pn(£) Py + Pfj Pn(0 ?v + Pi/ Pn(0 Py • 

In fi we still need to separate £ to one side of the inner product. Since we can choose whether we 
bring over the first P^ ^ (£) or the second, we introduce a constant c£K, 

h = 2c (Pn(X - M) ■ (X^ TxMk (0f, P P TxMk (0) 

+ 2(1 - c) ((P p TxMk (0xY ■ P n (X - A), P p TxMk (0) 
= 2c (pP TxMh [Pn(X - A) ■ P p TxMk (0 T ■ (XY], 
+ 2(1 - c) (P p TxMk [(xY ■ P p TxMk (0 T ■ ?n(X - A)], 0. 



The operator Hi is clearly linear. Imposing the symmetry requirement (Hi(£), ^) = (£, H\{y)) for 
any arbitrary tangent vector f, we get that c = 1/2, and so 



*i(fl = P T x ^jPn(^ - A)pP xA , fc (0 T (Xtf + (Xt) r P£ xMfc (fl T P n (X - A)] 
= P^ P n (X - A)y p E-V T + f/S" 1 ^ P^(X - A) P^ . 
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Finally, we can put together the whole Hessian 

Hess f(X)[Z] = P v P n (£) P v + P£(Pn(0 + ?n(X - A)V p T t - x V T ) P v 



+ P!7(P n (0 + UZ^UlPniX - A)) Pi 



and this also proves Prop. 2.3 
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